?£  3   Report  No.  363 

1  a. 


THE  USE  AND  PERFORMANCE 
OF  MEMORY  HIERARCHIES:  A  SURVEY 

by 

D.  J.  Kuck 
D.  H.  Lawrie 


December  h,   1969 


THE  LIBRARY  01:  tTHB 


DEPARTMENT  OF  COMPUTER  SCIENCE 
UNIVERSITY  OF  ILLINOIS  AT  URBANA-CHAMPAIGN 


URBANA,  ILLINOIS 


REPORT  NO.  363 


THE  USE  AND  PERFORMANCE 

OF  MEMORY  HIERARCHIES: 

A  SURVEY 

by 

D.  J.  Kuck 
D.  H.  Lawrie 


December  k,   I969 


Department  of  Computer  Science 

University  of  Illinois  at  Champ aign-Urbana 

Urbana,  Illinois  61801 


TABLE  OF  CONTENTS 

Page 

I.   Introduction  1 

II.   Page  Fault  Rate 3 

2.1  EFFECT  OF  PRIMARY  MEMORY  ALLOTMENT  ON  PAGE  FAULT  RATE  ...  3 

2.2  EFFECT  OF  PAGE  SIZE  AND  PRIMARY  MEMORY  ALLOTMENT 

ON  PAGE  FAULT  RATE 13 

2.2.1  FRAGMENTATIONS  AND  PAGE  SIZE lk 

2.2.2  SUPERFLUITY  VS.  PAGE  SIZE 17 

2.2.3  PRIMARY  MEMORY  ALLOTMENT  AND  PAGE  SIZE 18 

2.3  REPLACEMENT  ALGORITHMS 22 

2.U  PROGRAM  ORGANIZATION 25 

2.5   SUMMARY 26 

III.   Mult  iprogranming 26 

IV.  Average  Tine  Per  I/O  Request 3U 

i+.l  PHYSICAL  LATENCY  OF  SECONDARY  MEMORY 35 

-.2  EFFECTIVE  LATENCY  OF  SECONDARY  MEMORY  36 

U.3  REQUEST  QUEUEING 36 

h.k     MINIMIZATION  OF  EXPLICIT  I/O  REQUEST  TIME 37 

V.   Summary  and  Extensions Ul 

LIST  OF  FOOTNOTES hk 

BIBLIOGRAPHY 1+7 


ii 


LIST  OF  FIGURES 

Figure  Pa8e 

1.    Mean  time  to  reference  p  pages  as  a  function  of  p.  7 

2a.  E  vs.  (p,T)  surface  for  q  =  2  x  10  ,  a  =  3.8,  p  ■  2.U.  11 
2b.  E  vs.  (p,T)  surface  for  q  =  5  x  10  ,  a  =  3.6,  3  =  2.U  12 
3«    Memory  fragmentation  with  four  pages  of  size  b,  =  Uq, 

=  1.5Q,  b3  =  3-2Q  and  b^  =  Uq.   B  =  i+Q.  16 

Page  fault  rate  a  as  a  function  of  primary  allotment  m  and 
page  size  B.   Data  for  a  FORTRAN  compiler  is  from  Anacker 
and  Wane  [  ^  ]•   Note  B  scale  is  logarithmic.  20 

Page  fault  rate  \  vs.  M  and  B.   Data  for  a  SN0B0L  compiler 
from  Varian  and  Coffman  [135].   Note  \  scale  is  different 
than  Figure  2a.   Dashed  lines  indicate  locus  of  equal  ,\.        21 
5a.   CPU  efficiency  as  a  function  of  the  number  of  jobs  J  and 
average  I/O  completion  time  T.  Average  page  rate  is 
V(3.£  (uk/j)   *  )  and  explicit  I/O  interrupts  occur  every  10K 
instructions  on  the  average.  29 

CPU  efficiency  as  a  function  of  J  and  T.  Average  page  rate 
is  t[3."  (6V;J)   )and  explicit  I/O  interrupts  occur  every 

instructions  on  the  average.  30 

5c.   CPU  efficiency  as  a  function  of  J  and  T.  Average  page  rate 

is  1/^3.-  (32/j)-'  )   and  explicit  I/O  interrupts  occur  every  10K 
instructions  on  the  average.  31 

Relative  gain  G  in  efficiency  over  monoprogramming  for  optimal 

oer  of  jobs  vs.  average  I/O  completion  time  (normalized). 
a.  =  3»S>  3  =  2.U.  Numbers  on  curves  indicate  optimal  number 
of  jobs.  33 


iii 


Digitized  by  the  Internet  Archive 
in  2013 


http://archive.org/details/useperformanceof363kuck 


LIST  OF  TABLES 

Table  Page 

I.   Summary  of  Results  from  Varian  and  Coffman  [135].  6 


iv 


I.  Introduction 

The  fundamental  reason  for  using  memory  hierarchies  in  computer 
systems  is  to  reduce  the  system  cost.  System  designers  must  balance  the 
system  cost  savings  accruing  from  a  memory  hierarchy  against  the  system 
performance  degradation  sometimes  caused  by  the  hierarchy.   Since  modern 
computers  are  being  used  for  a  great  variety  of  applications  in  diverse 
user  environments,  the  hardware  and  software  systems  engineers'  task  is 
becoming  quite  complex.  In  this  paper  we  shall  discuss  a  number  of  the 
hardware  and  software  elements  of  a  memory  hierarchy  in  a  computer  system. 
Included  are  several  models  and  attempts  at  optimization. 

Computer  engineers  may  choose  from  a  number  of  optimization 
criteria  in  designing  a  computer  system.  Examples  are  system  response 
time,  system  cost,  and  central  processing  unit  (CPU)  utilization.  We 
shall  primarily  discuss  CPU  utilization  and  then  relate  this  to  system 
cost.   Such  considerations  as  interrupt  hardware  and  scheduling  algorithms 
determine  response  time  and  are  outside  the  scope  of  this  paper. 

In  order  to  discuss  CPU  utilization,  let  us  list  a  number  of 
reasons  for  non-utilization  of  the  CPU.  That  is,  assuming  that  a  user 
or  system  program  is  being  executed  by  the  CPU,  what  may  be  the  causes 
of  subsequent  CPU  idleness. 

1)  The  computation  is  completed. 

2)  A  user  program  generates  an  internal  interrupt  due 
to  e.g.,  an  arithmetic  fault. 
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3)  A  user  program  generates  an  explicit  I/O  request  to 

secondary  storage. 
h)     The  system  generates  an  internal  interrupt  due  to 

e.g.,  a  page  fault. 

5)  The  system  generates  a  timer  interrupt. 

6)  The  system  receives  an  external  interrupt  from  e.g., 
a  real  time  device. 

We  are  using  "system"  here  to  mean  hardware,  firmware,  or  software. 
Point  l)  will  be  implicitly  included  in  some  of  our  discussions  by 
assuming  a  distribution  of  execution  times.  Point  2)  will  not  be 
discussed.   Point  3)  will  be  discussed  in  some  detail  and  point  h) 
will  be  given  a  thorough  discussion.   Points  5)  and  6)  fall  under 
system  response  time  and  will  note  be  explicitly  discussed. 

If  a  program  (instructions  and  data)  is  being  executed,  let 
us  del'ine  a  page  fault  to  be  the  generation  by  the  system  of  an  address 
outside  the  machine's  primary  memory.   This  leads  to  the  generation  by 
the  system  of  an  I/O  request  to  the  secondary  memory.   Now  we  can  des- 
cribe the  CPU  idle  time  for  both  points  3)  and  h)   above,  by 

CPU  I/O  idle  time  = 
number  of  I/O  requests  x  average  time  per  I/O  request 
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In  this  equation,  "average  tiirj  per  I/O  request"  is  the  interval  from 
when  an  I/O  request  occurs  until  some  user  program  is  again  started. 
Notice  that  we  are  including  both  the  case  of  explicit,  user  initiated 
I/O  requests  and  the  case  of  implicit  system  generated  page  faults  which 
lead  to  I/O  requests  to  the  secondary  memory.  Much  of  our  discussion 
will  be  centered  on  the  minimization  of  one  or  the  other  of  the  terms  on 
the  richt  hand  side  of  this  equation. 

It  should  be  observed  that  this  equation  holds  for  multiprogrammed 
as  well  as  monoprogrammed  systems.  In  a  monoprogrammed  system,  the  "average 
time  per  I/O  request"  is  defined  as  the  interval  frcm  when  an  I/O  request 
occurs  for  some  program  until  that  program  is  again  started.  We  regard 
the  execution  of  operating  system  instructions  as  CPU  idle  time.   In  a 
multiprogramming  situation,  the  average  time  per  I/O  request  is  decreased 
by  allowing  several  users  to  interleave  their  I/O  requests  and  we  shall 
also  deal  with  this  case. 

II.   Page  Fault  Rate 

In  this  section  we  will  deal  with  the  first  term  on  the  right 
hand  side  of  the  equation  of  Section  I.   In  particular,  we  will  restrict 
our  attention  to  the  rate  of  generation  of  page  fault  I/O  requests,  explicit 
I  0  requests  being  ignored.  We  consider  only  demand  paging  where  one  page 
at  a  tir.e  is  obtained  from  secondary  memory. 

2.1  EFFECT  OF  PRIMARY  MEMORY  ALLOTMENT  ON  PAGE  FAULT  RATE 

Obviously,  the  page  fault  rate  will  be  zero  if  all  of  a  program's 
instructions  and  data  are  allowed  to  occupy  primary  memory.   On  the  other 
hand,  it  has  been  demonstrated  that  a  small  memory  allotment  can  lead  to 

-  3  - 


disastrous  paging  rates.  The  relationship  between  primary  memory  allot- 
ment and  page  faults  has  been  studied  by  a  number  of  workers  [  12  ,  40  , 

Ul,   95,  109,  125,  127,  128,  132]  and  many  experiments  have  been  conducted 
to  determine  program  paging  behavior  [  k,       9,  11,  18,  27,   55,   62, 

95 ,  108,  111,  133 >   135]' 

One  of  the  statistics  which  is  of  interest  is  the  length  of  the 
average  execution  burst.  We  will  define  an  execution  burst  <p  to  be  the 
number  of  instructions  executed  by  a  program  between  its  successive  page 
faults.   The  average  execution  burst  is  measured  by  allowing  a  program 
an  initial  allotment  of  pages  p0  (usually  1  page)  and  then  allowing  the 
program  to  accumulate  more  pages  of  memory  until  it  acquires  p„  pages. 
At  this  point,  any  new  pages  required  by  a  program  must  be  swapped  for 
pages  already  in  primary  memory.   In  addition,  an  upper  bound,  q,  on  the 
i.otal  number  of  instructions  executed  is  sometimes  imposed  on  the  program. 
This  q  may  be  thought  of  as  a  time  slice  or  as  any  condition  which  causes 
the  program  to  be  swapped  out  of  primary  memory. 

Fine,  et  al.  [  55]  presents  the  results  of  experiments  for 

pn=  1,  p  =  «,  q  <  8  x  10  ,  and  a  page  size  of  1024  words  which  indicate 

2 
1  hat  almost  ^9%  °f  all  execution  bursts  were  less  than  20.  However,  this 

uata  includes  the  results  of  explicit  I/O,  and  it  was  assumed  that  all.  of 

a  program's  pages  were  swapped  out  of  primary  memory  when  an  explicit  I/O 

3 
request  was  made."   This  would  tend  to  lower  the  average  cp  since  cp  would 

include  the   effects  of  a  lot  of  short  execution  bursts  which  occur  when 

It 

a  program  is  trying  to  acquire  a  sufficient  working  set  of  pages. 

Varian  and  Coffman  [135  J  also  presented  this  type  of  data  and  their 
results  are  broken  down  by  program  and  instructions  vs.  data  for  several 
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values  of  p  .  These  results  are  summarized  in  Table  1.  They  apparently 
do  not  include  the  results  of  explicit  I/O,  and  are  based  on  a  page  size 
cf  102*+  words.  These  statistics  indicate,  as  did  those  of  Fine,  et  al., 
that  execution  bursts  tend  to  be  quite  small.  They  also  indicate  that  the 
average  execution  burst  is  quite  sensitive  to  p„  as  we  might  expect. 

Another  statistic  in  which  we  are  interested  is  the  mean  time 
required  to  reference  p  distinct  pages  t  =  f(p).   Data  cited  by  Fine, 
et  al  indicates  the  shape  of  the  t  =  f(p)  curve  is  as  shown  in  Figure  1. 
Fine's  data  indicates  that  a  significant  number  of  pages  are  referenced 
within  a  very  short  tine.   For  example,  in  half  of  the  cases  measured, 
tiie  first  ten  pages  were  required  within  500  instructions  (median).   This 
has  obvious  implications  for  any  virtual  memory  system  which  does  not  have 
sufficient  primary  memory. 

Shcmer  and  Sl.ippey  [I2y]  show  that  if  the  arrival  of  page  faults 
car.  be  modeled  by  a  Poisson  process  where  \,    is  the  probability  of  a  page 
fault  during  At  given  that  i  distinct  pages  have  already  been  referenced, 
then  the  average  time  t  to  reference  p  pages  is  just 


p:1    ± 
t  \.    ,     p  >  i   . 

p       i=i      x 


:  ir.(T  an  empirical  curve  for  t  =  f(p)  (Figure  l)  we  can  determine 
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Since  tp+1     -     tp     -     f(p+l)  -  f(p)  -    ^^Pl    4? 


we  have 


1\    S    ^El      ^       .  (!) 


Thus,  we  can  determine  the  X.  probabilities  by  examining  empirical  t 


curves. 


We  will  model  the  t  function  of  a  program  with  the  formula 


f(p)  =   6P7  .  (2) 


This  formula  has  been  applied  to  the  f(p)  data  presented  by  Fine,  et  al 
and  it  was  determined  that  5  =  1.1  and  y   =  3»^»  Using  Eqs.  (l)  and  (2) 

vmere  Ap  =  1,  we  find 

!AP  =  ^  =  r^1  «  ^  (3) 

or  l/\      =  3.8  p2,U  . 


Given  we  are  in  state  p  (p  most  recently  referenced  pages  in  primary 

armory)  the  probability  of  referencing  a  new. page  (page  fault)  at  time  t, 

-k   t 
assuming  a  Poisson  distribution  is  given  by  p(t|p)  =  1-e  P  .  Now,  if 

we  assume  that  we  force  the  system  to  remain  in  state  p  by  replacing  the 

least  recently  used  page  with  the  new  page  each  time  a  page  fault  occurs, 

then  we  might  expect  the  system  to  continue  to  behave  as  before;  i.e., 

t..e  system  will  continue  to  generate  faults  according  to  1-e  *  .  It 
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can  then  be  shown  that  the  mean  time  between  page  faults  in  state  p 
is  just 

<p.(p)  =  i/\p  =  opp  (k) 

where  we  define  <p  (p)  to  be  the  steady  state  average  execution  burst  given 

00 

p  pages  in  core. 

The  average  execution  burst  over  time  q,  <Pa(p)>  given  a  program 
starts  with  one  page  and  is  allowed  a  maximum  of  p  pages  should  be 
derived  using  distributions  of  q  (see  Smith  [132]  and  Freibergs  [  62] 
for  q   distributions)!  but  is  beyond  the  scope  of  this  paper.  We  shall 
settle  for  the  approximations 


Vp)  "   ?VJ    '     q-*p  (5) 


-i  ft 

where  f  (q)  is  the  average  number  of  pages  referenced  in  time  q  <  t  . 


In  case  q  >  t 
P 


Vp)  :  F^fey  '  q>tP 


(6) 


where  p  +  >.  (q  -  t  )  is  the  total  number  of  page  faults  generated  in 

time  a  >  t  .   If  a  »  t  ,  then  X  q  »  p  and  we  have: 
p  P       P 


<pq(p)  -  iAp       q>>tP  <T) 


which  is  Eq.  (U)  as  q  •*  °°. 


-    9  " 


Each  time  a  page  fault  occjrs,  we  have  to  pay  an  average  time 
T  to  make  space  for  and  make  present  a  page  from  secondary  memory.  Thus, 
we  can  define  the  (rnonoprogrammed)  CPU  efficiency  factor  as 


<P  (p) 


(8) 


where  T  is  measured  in  average  instruction  times.  Figures  2a  and  2b  show 

several  E  vs.  (p,t)  surfaces  for  q  =  2  x  10  and  5  x  10 J  where  approxi- 

2.1+ 
nations  (5)  and  (6)  were  used  to  compute  <p  (p),  and  l/\  =  3*8  P   • 

Looking  at  Figure  2a  we  notice  that  in  the  region  of  low  p, 
the  only  way  to  get  higher  efficiency  is  to  use  a  fast  secondary  memory. 
However,  secondary  memories  with  T  <  1000  average  instruction  times  would 
correspond  to  extended  core  storage.   In  the  region  of  larger  T  correspond- 
ing to  drums  and  fast,  head  per  track  disks,  the  only  way  to  achieve  rea- 
sonable rnonoprogrammed  efficiency  is  by  providing  sufficient  primary  memory. 

Figure  2b  shows  the  effect  of  a  smaller  time  quantum.  In  this 
case,  efficiency  is  sensitive  to  T  and  insensitive  to  p  over  almost  the 
entire  surface.  This  is  due  to  the  fact  that  programs  corresponding  to 
a   =  3«8  and  £  =  2.k   seldom  reference  more  than  about  12  pages  within  the 
quantum  time  q  =  5000.  While  this  surface  was  computed  using  a  constant 
q  instead  of  using  a  statistical  distribution  of  q,  it  still  indicates 
what  can  happen  to  individual  program  efficiency  when  programs  are  swapped 
out  of  primary  memory  for  a  (non-page  fault)  1/0  interrupt  or  a  small  system 
imposed  time  quantum.  The  actual  degradation  will,  of  course,  depend  on 
the  characteristics  of  the  program  (a,0)  as  well  as  the  system's  ability 
to  mask  1/0  with  multiprogramming  techniques. 
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In  this  section  we  have  presented  a  very  simple  model  of  program 
paging  behavior  in  terms  of  the  average  time  required  to  reference  p  pages 


t   -  Bp? 
P 


Then,  under  the  assumption  that  paging  is  a  Poisson  process,  we  derived 
the  a /era^e  execution  burst  as  a  function  of  the  number  pages  in  primary 
memory 


~   d  K   ~      s 

^(p)  ■  -dT  =  °* 


Using  these  relations  and  values  X,  O.   and  £  derived  from  Fine's  results, 
we  showed  the  effect  on  monoprogrammed  efficiency  of  a  gross  time  charac- 
teristic T  of  secondary  memory,  primary  memory  allotment,  and  time  quantum 
q.  This  was  done  under  the  assumption  that  the  page  size  was  102U  words 
and  that  a  least-recently-used  page  replacement  algorithm  was  used.   In 
the  following  sections,  we  will  examine  the  effects  of  different  page 
sizes,  replacement  algorithms,  and  the  use  of  multiprogramming  to  mask 
I/O  time. 

2.2  EFFECT  OF  PAGE  SIZE  AND  PRIMARY  MEMORY  ALLOTMENT  ON  PAGE  FAULT  RATE 

In  the  previous  section  we  assumed  that  the  page  size  was  fixed 
at  102U  words.  As  we  shall  see  in  this  section,  the  page  size,  b,  will 
effect  the  page  fault  rate  \  for  two  reasons.  First,  primary  memory  may 
be  underutilized  to  some  extent  due  to  a)  primary  memory  not  being  filled 
with  potentially  useful  words,  i.e.,  fragmentation  and  b)  the  presence  of 
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words  which  are  potentially  useful  but  which  are  not  referenced  during  a 
period  when  the  page  is  occupying  primary  memory,  i.e.,  superfluity.  Any 
underutilization  of  primary  memory  tends  to  increase  the  page  rate  since 
the  effective  memory  allotment  is  decreased  as  analyzed  in  the  last  section 
Second,  more  page  faults  may  be  generated  when  the  page  size  is  b  than  when 
page  size  is  2b,  simply  because  we  only  have  to  generate  one  page  fault  to 
reference  all  words  in  the  2b  page  whereas  to  reference  the  same  words  we 
have  to  generate  two  faults  if  the  page  size  is  b. 

2.2.1  FRAGMENTATIONS  AND  PAGE  SIZE 

We  assume °  that  a  program  consists  of  a  number  of  segments  of 
size  s  where  s  varies  according  to  some  statistical  distribution  with 
mean  s.  These  segments  may  contain  instructions  or  data  or  both.  The 
words  of  a  segment  are  logically  contiguous,  but  need  not  be  stored  in  a 
physically  contiguous  way.  Each  segment  is  further  divided  into  a  number 
of  pages.  The  pages  consist  of  b  words  which  are  stored  in  a  physically 
contiguous  way.  To  allow  for  variable  page  size,  we  assume  the  system 
imposes  a  size  quantum  Q  <  B  on  all  storage  requests  such  that  requests 
are  always  rounded  up  to  the  next  multiple  of  Q.  Page  size  b  may  be  any 
multiple  of  Q,  but  may  not  exceed  B  which  is  the  largest  number  of  neces- 
sarily physically  contiguous  words  which  the  system  can  handle.  The  ratio 
B/Q  may  be  thought  of  as  an  index  of  the  variability  of  the  page  size.  All 
pages  of  a  segment  will  be  of  size  b  =  B  except  the  last  which  will  be  some 
multiple  n  of  Q,  b  =  nQ  <  B.  The  physical  base  address  of  a  page  may  be 
any  multiple  of  Q;  that  is,  it  may  be  loaded  beginning  at  any  address  which 

is  a  multiple  of  Q.  For  example,  if  the  maximum  segment  size  s    =  B  = 

21 
1024  and  Q  =  1,  then  we  have  the  case  corresponding  to  the  Burroughs  B5500. 
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If  Q  »  B  and  s    »  B,  then  we  have  the  case  of  more  conventional  paging 
systems. 

Thus,  we  night  have  several  pages  allocated  in  primary  memory 
as  shown  in  Figure  3  where  Q  ■  B/U.  Notice  that  there  are  two  sources 
of  memory  waste  evident  in  Figure  3»  First,  memory  is  wasted  because 
every  storage  request  must  be  rounded  up  to  a  multiple  of  Q  as  shown  by 
the  wavy  lines.  We  refer  to  this  as  internal  fragmentation.   Second, 
memory  is  wasted  because  there  are  four  blocks  of  Q  words  which  cannot 
be  used  to  hold  a  full  page  because  they  are  not  contiguous.  This  is  the 
classical  situation  of  checkerboarding,  which  we  will  refer  to  as  external 
fragmentation.  Notice  that  as  Q  ■*  1,  internal  fragmentation  diminishes 
to  zero,  while  as  Q  -  B,  external  fragmentation  disappears.  The  exact 
amount  of  waste  will  be  dependent  on  Q,  B,  and  the  distribution  of  segment 
sizes. 

Randell   [113]  has  studied  the  effects  on  memory  utilization 
of  variations  in  these  parameters.  His  results  indicate  that:  l)  loss 
of  utilization  due  to  external  fragmentation  when  Q  «  B  is  not  as  great 
as  loss  due  to  internal  fragmentation  when  Q  =  B;  and  2)  utilization  does 
net  change  significantly  with  changes  in  the  mean  segment  size  if  Q  «  B, 
but  it  does  change  significantly   with  s  if  Q  =  B.  It  is  also  apparent 
that  if  s  »  B,  then  Q  makes  little  difference. 

Tr.e  conclusion  from  this  is  that  if  a  program  is  to  be  segmented 
where  s  =  B,  then  small  Q  is  definitely  desirable.   If  the  page  size  must 
b<i   fixed  (Q  =  B),then  it  should  be  small.  However,  the  increase  in  memory 
u1 ilization  due  to  smaller  B  must  be  offset  by  the  possible  cost  increase 
for  the  required  paging  hardware.  For  instance,  if  we  can  double  memory 
utilization  by  decreasing  the  block  size  by  some  factor,  then  we  can 
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Figure  3.  Memory  fragmentation  with  four  pages  of 
size  bx  o  UQ,  t>2  =  1.5Q,  b  -  3.2Q  and 

b.  »  Uq.  b  =  Uq. 


-  16  - 


afford  to  spend  1/2  the  total  cost  of  primary  memory  on  the  increased 
paging  hardware. 

Unfortunately,  a  small  B  or  Q  is  not  the  entire  answer.  While 
small  B  or  Q  increases  memory  utilization  and  thus  reduces  the  page  rate 
for  a  given  memory  allotment,  small  B  or  Q  may  also  result  in  u  corres- 
ponding increase  in  page  rate  for  reasons  we  will  discuss  in  2.2.3* 

2.2.2  SUPERFLUITY  VS.  PAGE  SIZE 

Another  factor  which  leads  to  an  effective  underutilization 
of  prir.ary  memory  arises  from  instruction  or  data  words  which  are  loaded 
into  primary  memory  as  part  of  a  page  but  are  never  referenced  during 
that  period  of  residency.  We  will  refer  to  these  words  as  superfluous. 

We  can  obtain  slower  bound  on  the  number  of  superfluous  words 

by  examining  the  total  primary  memory  requirements  M  of  a  program  as  a 

12 
function  of  page  size  .  That  is,  assume  primary  memory  is  unlimited, 

then  M(B)  is  the  total  amount  of  primary  memory  occupied  after  a  given 

execution  of  the  program  with  page  size  B.  Now,  given  unlimited  primary 

memory,  if  the  program  is  run  with  page  sizes  b  =  B  and  b  =  1,  then  at 

least  MCE)  -  H(l)  words  must  be  superfluous.   If  we  force  the  program  to 

run  with  primary  memory  m  <  M(B),  then  page  faulting  will  occur  and  the 

number  of  superfluous  words  may  increase  over  M(B)  -  M(l)  since  some  words 

which  are  eventually  referenced  are  not  referenced  during  same  period  of 

their  page  residency  and  are  thus  superfluous  during  that  period. 

O'Neill  [108 ]13  and  Belady  [  ll]1  present  M(B)  statistics  which 

are  remarkably  linear  over  the  ranges  256  <  B  <  2048  and  128  <  B  <  1024, 

respectively.  Even  for  larger  page  sizes  M(B)  is  reasonably  linear,  but 

for  small  B,  M(B)  drops  off  sharply.  Thus,  we  can  assume 
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M(B)  =  aQ  +  a^      256  <  B  <  1024  (9) 

and  a.B  is  a  lower  bound  on  the  number  of  superfluous  words. 

Unfortunately  Eq.  (9)  only  establishes  a  lower  bound  on  the 
number  of  superfluous  words.   It  does  not  tell  us  anything  about  the  aver- 
age number  of  superfluous  words  present  when  primary  memory  is  less  than 
that  absolutely  required  by  the  program.  The  authors  know  of  no  published 
data  which  pertain  directly  to  superfluous  words  in  this  case,  so  we  shall 
move  on  to  determine  the  overall  effect  of  block  size  on  the  paging  rate  \. 

2.2.3  PRIMARY  MEMORY  ALLOTMENT  AMD  PAGE  SIZE 

In  Section  2.1  we  discussed  the  average  execution  burst  cp(p)  as 
a  function  of  memory  allotment  in  units  of  p,  the  number  of  b=B=102U  word 
pages.   In  this  section  we  will  examine  the  paging  rate  X   =  l/cp  as  a  func- 
tion of  primary  memory  allotment  in  words  m  =  pB,  for  various  values  of 
page  size  b=B. 

We  would  expect  that  for  small  m,  \  will  vary  considerably  with 
the  page  size.  This  is  because  for  small  m,  the  average  time  each  page 
is  in  primary  memory  will  be  relatively  short,  and  so  the  extra  words  in 
larger  pages  will  tend  to  go  unreferenced  and  will  only  take  up  space 
which  might  better  be  occupied  by  new,  smaller  pages.  On  the  other  hand, 
as  m  increases,  we  would  expect  to  see  page  size  have  less  effect  since 
the  probability  will  be  higher  that  more  words  in  the  page  will  be  refer- 
enced due  to  the  longer  expected  page  residence  time.  In  addition,  we 
might  also  expect  to  see,  for  a  given  m,  a  B  ,  such  that  any  B..  >  B 
will  only  include  superfluous  words  and  any  Bp  <  B    will  not  include 
enough  words. 
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Figure  4a  is  a  graph  oJ  X  vs.  B  and  m  based  on  experimental 
data  from  a  FORTRAN  compiler  [ h   ]   .  This  graph  clearly  exhibits  that 
when  a  program  is  "compressed,"  i.e.,  run  in  a  smaller  memory,  large  page 
sizes  lead  to  excessive  paging.  When  the  page  size  is  small,  then  the 
program  tends  to  be  more  compressible.  As  m  gets  larger,  the  paging 
behavior  becomes  less  a  function  of  B,  and  for  large  enough  m,  small  B 

-  even  increase  the  page  rate.  Slight  minimum  points  were  observed  at 
the  (rn,B)  points  (2K,64),  (MC,256),  (8k,256).  This  illustrates  that  if 
talninw  exist,  then  they  are  not  necessarily  independent  of  m. 

Figure  4b  is  another  graph  of  \  vs.  m  and  B  data  for  a  SNOBOL 
compiler  [135l«  This  program  is  evidently  much  less  "compressible"  than 
the  FORTRAN  compiler  in  Figure  4a.  However,  it  shows  the  same  general 
tendencies  as  Ficurc  4a  except  for  the  apparent  lack  of  minima. 

Another  way  to  view  the  X  vs.  (m,B)  relationship  can  be  seen 
observing  in  Figure  4b  the  dashed  lines  which  pass  through  points  of 
eaual  \,      Notice  that  ,\(6K,  256)  is  only  slightly  lower  than  \(4k,  64). 
Thus,  we  can  affect  an  almost  equal  tradeoff  between  half  as  much  primary 
memory  and  1/4  the  page  size;  i.e.,  we  double  the  number  of  pages  but  each 

page  is  only  1/4  as  large.  However,  we  must  also  consider  the  increase 

17 
paging  Hardware  necessary  to  handle  the  larger  number  of  pages. 

Tr.e   r.ain  point  to  be  had  from  these  figures  is  that  programs  are 
more  cor.pressible  when  B  is  small;  i.e.,  they  will  tolerate  a  much  smaller 
primary  memory  allotment  if  B  is  small.   However,  too  small  a  B  may  lead 
to  a  slight  increase  in  paging  activity.   (See  also  a  study  performed  on 
the  ATIAS  system  by  Baylis,  et  al.  [9  ].) 

The  above  results  further  support  arguments  for  variable  page 
sizes  allowing  logically  dependent  words  (e.g.,  subroutines  or  array  rows) 


-  19- 


J- 

on 

CVJ 

H 

-d- 

o 

o 

o 

O 

OJ 

• 

' 

o 

H 

m 


S£ 


-   20  - 


<< 


,3 


-  21   - 


to  be  grouped  in  a  page  without  leading  to  underutilization  of  memory 
due  to  internal  fragmentation  or  superfluity.  Logical  segmentation  of 
code  and  data  will  be  taken  up  more  generally  in  later  sections. 

2.3  REPLACEMENT  ALGORITHMS 

Whenever  it  is  necessary  to  pull  a  new  page,  i.e.,  transfer 
a  new  page  from  secondary  to  primary  memory,  it  is  also  necessary  to 
select  a  replacement  page  in  primary  memory  to  be  pushed  (transferred 
to  secondary  memory)  or  overlayed.  If  we  assume  that  all  programs  are 
in  the  form  of  pure  procedures,  then  we  never  need  to  push  program  pages. 
Data  pages  need  to  be  pushed  only  if  we  have  written  into  them.   The 
selection  of  a  replacement  page  is  done  by  a  replacement  algorithm.  A 
number  of  these  algorithms  have  been  proposed  and  evaluated  [  9,  H> 
12,   17,  lo,  27,  kO,     kl,     86,   116,  125,  135]  where  Belady  [  11  ]  has 
produced  the  most  extensive  summary  and  evaluation  to  date.   The  various 
algorithms  can  be  classified  according  to  the  type  of  data  which  is  used 
by  the  replacement  algorithm  in  choosing  the  replacement  page. 

Type  l)  The  first  type  of  information  pertains  to  the  length 
of  time  each  page  has  been  in  primary  memory.   The  page  (or 
class  of  pages)  which  has  been  in  memory  the  longest  is 
pushed  or  overlayed  first.   This  information  forms  the  basis 
of  what  are  usually  referred  to  as  FIFO  algorithms.   This  is 
the  simplest  type  of  information  to  maintain  and  it  usually 
requires  no  special  hardware  to  implement. 

Type  2)  Type  2  information  is  similar  to  Type  1  information 
but  "age"  is  measured  by  the  time  since  the  last  reference 
to  a  page  rather  than  how  long  the  page  has  been  in  primary 
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memory.   This  information  is  the  basis  of  the  so-called 
least -recently-used  replacement  algorithms.  Many  variations 
exist,  e.g.,  based  on  the  fineness  of  age  measurement.   Systems 
which  accumulate  this  type  of  information  usually  employ 
some  type  of  special  hardware  to  record  page  use  statistics. 
Type  3)  Information  as  to  whether  or  not  the  contents  of 
a  page  have  been  changed  is  frequently  used  to  bias  the 
selection  towards  pages  which  have  not  been  changed  and 
thus  do  not  have  to  be  pushed  (but  simply  overlayed)  since 
an  exact  copy  is  still  available  in  secondary  memory. 
Special  hardware  is  needed  to  record  the  read-only/write 
status  of  each  page  in  primary  memory. 

Type  U)  In  the  ATLAS  system  [  9  ,  86  ]  the  length  of  the 
last  period  of  inactivity  is  recorded  for  all  pages  in  a 
program.  This  information  is  used  to  predict  how  long  the 
current  period  of  inactivity  will  be,  i.e.,  how  soon  a  page 
will  be  referenced  again.  Replacement  is  biased  towards 
pages  which,  on  the  basis  of  this  information,  are  expected 
to  be  inactive  for  the  longest  time.  This  type  of  information 
is  particularly  useful  for  detecting  program  loops  as  was 
intended  by  the  ATLAS  designers. 
Belady  [  11  ]  has  evaluated  the  performance  in  terms  of  page  fault  rate  of 
a  nu".ber  of  algoritnr.s  as  functions  of  page  size  and  primary  memory  allot- 
nent,  anu  we  will  now  discuss  his  results. 

The  simplest  algorithm  studied  was  the  RAM)OM  algorithm.  This 
uses  no  information  about  pages,  but  chooses  a  replacement  page  randomly 
from  those  in  primary  memory.  The  use  of  Type  1  information  (time  in 
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primary  memory)  never  significantly  improves  performance  relative  to 
RANDOM  and  in  some  cases  performance  is  worse  than  RANDOM. 

The  use  of  Type  2  information  (time  since  last  read  or  write) 
leads  to  the  most  significant  and  consistent  improvement  in  performance. 
With  these  algorithms  the  accuracy  with  which  "age"  is  measured  does  not 
seem  to  have  much  effect  on  performance,  however.  That  is,  performance 
does  not  change  significantly  whether  we  keep  a  complete  time  history  of 
pages  in  primary  memory,  or  just  divide  all  pages  into  two  classes— 
recently  used  and  not-so-recently  used.   The  use  of  Type  3  information 
(read  only /write  status)  in  addition  to  Type  2  information  does  not  affect 
the  total  number  of  page  faults  very  much.   However,  it  does  increase  per- 
formance due  to  the  fact  that  no  push  is  required  on  10  to  6of0   of  all  page 
faults. 

The  ATLAS  algorithm  [86]  which  used  both  Type  2  and  h   informa- 
tion is  the  most  complex  algorithm  studied,  and  it  is  interesting  to  note 
that  it  consistently  leads  to  worse  results  than  Type  2  algorithms  and  is 
sometimes  worse  than  RANDOM  or  FIFO.  This  result  has  been  further  sub- 
stantiated by  Baylis,  etal.  [  9  ]•  Apparently,  the  problem  is  that  most 
programs  do  not  have  a  regular  or  small  enough  loop  structure  to  warrant 
the  use  of  the  ATLAS  algorithm  which  is  intended  to  take  advantage  of 
program  loops. 

Thus,  algorithms  which  make  replacements  are  the  basis  of  least 
recently  referenced  pages  and  bias  towards  read-only  pages  would  seem  to 
be  best  in  terms  of  cost  effectiveness.  However,  for  existing  systems 
which  do  not  have  the  hardware  necessary  to  automatically  maintain  Type  2 
and/or  Type  3  information,  RANDOM,  FIFO  or  programmer  directed  schemes 
must  be  used. 
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2.1*  PROGRAM  ORGANIZATION 

Coroeau  [  30]  has  shown  that  simply  by  reordering  the  assembler 
deck  of  the  Cambridge  Monitor  System  to  cause  logically  dependent  routines 
to  be  grouped  together,  paging  of  the  monitor  was  reduced  by  as  much  as 
bO%.     Brawn  and  Gustavson  [  18]  and  McKellar  and  Coffman  [103]  have 
shown  that  simple  changes  in  computation  algorithms,  such  as  referencing 
matrices  by  square  partition  instead  of  row  or  column,  can  also  affect 
large  improvements  in  paging  activity.   (See  also  [  36,   37 >     51>  7330 
These  studies  indicate  that: 

1)  Programmers  need  to  be  aware  of  the  paged  and/or  segmented 
environment  in  which  their  programs  will  be  executed. 
Program  optimization  by  reducing  page  faults  is  more 
important  than  classical  optimization  techniques  (e.g., 
common  subexpression  elimination). 

2)  Prorp-amners  should  be  able  to  direct  or  advise  the  compiler 
as  to  which  code  should  be  placed  in  which  page/segment. 

3)  If  possible,  subroutine  or  procedure  code  should  be  placed 
in  the  code  segment  where  it  is  called.  If  this  code  is 
smell  and  is  used  in  several  different  segments,  then 
several  copies  of  the  subroutine  could  be  generated,  one 
in  each  segment  where  it  is  called. 

h)     More  emphasis  should  be  placed  on  compiler  optimization 
of  code  through  strategic  segmentation.  For  example,  by 
analyzing  the  structure  of  a  program  (see  Martin  and  Estrin 
[99])  the  compiler  could  make  better  segmentation  decisions 
and  provide  information  which  the  operating  system  could 
use  to  make  replacement  decisions,  and  to  perform  prepaging. 
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In  addition,  compilers  might  be  able  to  detect  certain 
cases  of  poor  data  referencing  patterns  and  issue  appro- 
priate warnings  to  the  programmer. 
Thus,  we  can  improve  paging  behavior  both  by  changing  the  physi- 
cal  parameters  of  the  system  and  by  intelligent  program  organization.  The 
latter  method  would  appear  to  have  a  higher  cost  effectiveness  and  should 
not  be  overlookeu. 

2.5   SUMMARY 

As  we  have  noted,  CPU  efficiency  can  be  related  to  the  page 
fault  rate  and  the  average  time  T  to  satisfy  these  I/O  requests.   In 
Section  II  we  have  tried  to  illustrate  the  relationships  between  page 
fault  rate  and  primary  memory  size,  primary  memory  allotment,  page  size, 
replacement  algorithm,  program  organization,  and  secondary  memory 
characteristics.   Our  intent  has  only  been  to  indicate  trends  and  general 
relationships,  and  with  this  in  mind  our  models  have  not  been  very  elaborate. 
However,  all  our  models  have  been  based  on  observed  program  behavior  and 
are  probably  accurate,  at  least  for  the  classes  of  programs  studied. 

III.   Multiprogramming 

Multiprogramming  arises  for  two  reasons: 

1)  In  an  attempt  to  overlap  I/O  time  by  having  one  program 
be  executed  while  other  procrams  are  waiting  for  I/O 
(implicit  or  explicit). 

2)  In  order  to  provide  quick  response  to  several  real 
time  .iobs  (time  sharing,  process  controls,  etc.) 

will  concern  ourselves  only  with  the  first  of  these  functions. 
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Whenever  several  concurrent  programs  share  memory 
in  order  to  "mask"  I/O  time  each  program  operates  with  less  primary 
memory  than  it  would  have  if  it  were  running  alone.  As  we  have  seen, 
this  causes  the  paging  rate  for  each  program  to  increase.  On  the  other 
hand,  by  multiprogramming  we  are  able  to  decrease  the  average  time  per 
I/O  request  (both  paging  and  explicit).  Several  questions  now  arise: 
First,  when  does  the  degradation  of  efficiency  due  to  increased  page 
traffic  become  greater  than  the  increase  in  efficiency  due  to  more  I/O 
masking.   Second,  how  much  of  an  improvement  can  we  expect  with  multi- 
programming over  monoprogramming. 

Gaver  [  65]  has  presented  an  analysis  of  multiprogramming  based 
on  a  probability  model  which  relates  CPU  efficiency  to  the  number  of  con- 
current jobs  J.  where  each  job  runs  for  an  average  of  l/h   instructions 
(hyperexponentially  distributed)  before  generating  an  I/O  interrupt,  and 
I/O  requires  an  average  of  T  instruction  times  to  complete  (exponentially 
distributed).    Unfortunately,  Gaver  does  not  consider  the  fact  that  as 
J  increases,  each  job  must  be  executed  with  less  primary  memory  and  thus 
paging  I/O  increases.   However,  this  is  fairly  easy  to  add  to  his  model, 

using  the  results  of  Section  2.1. 

20 

Suppose  the  total  available  primary  memory  is  M  pages   and  all 

programs  are  identical  and  are  allocated  equal  amounts  of  this  memory. 

21 

Then  the  memory  allotment  for  each  program  is  just  M/j.  '   The  paging  rate 

>.  for  each  program  as  a  function  of  J  is  then 


>-<J>  ■  mu    ■  (10) 
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where  cp(p)  was  defined  in  Section  2.1.  We  will  assume  this  is  exponent- 
ially distributed.  As  in  Section  2.1  we  will  use  the  function  a,^A/jy 
to  model  <p(M/j).   Thus 


X(J)  =  — ^r  (11) 


We  also  assume  that  explicit  (non-page)  I/O  interrupts  are  generated  at 
a  rate  r,  so  the  total  I/O  interrupt  rate  for  each  program  is 


X(J)  =  r  +  ±—  (12) 


Using  Eq.  (12)  and  Gaver's  equations,  we  have  computed  CPU 
efficiency  as  a  function  of  the  number  of  identical  jobs  J  and  average 
I/O  completion  time  T  for  several  values  of  r  and  M  and  for  a  =  3»8  and 
(3  =  2.U  (see  Section  2.1,  Eq.  3)«   The  results  of  these  computations  are 
plotted  in  Fif^ires  5a,  5b,  and  5c 

Figure  5a  shows  CPU  efficiency  for  M  =  6k   and  l/r  =  10,000 
(e.g.,  primary  memory  is  6hK   words  and  each  program  generates  an  explicit 
I/O  interrupt  every  10,000  instructions  on  the  average).   Notice  that  effic- 
iency increases  with  J  up  to  an  optimal  value  at  J  .  due  to  interleaving 
1/0  time.   Thereafter,  efficiency  decreases  because  of  increased  paging. 
Fir. re  5b  corresponds  to  K   =  6h   and  l/r  =  5000.   Efficiency  is  less  than 
that  in  Figure  5a  due  to  the  increased  explicit  1/0,  and  the  gain  in 
efficiency  for  monoprogrammed  vs.  multiprogrammed,  E(j  . )  -  E(l),  is 
more  pronounced.   Figure  5c  corresponds  to  M  =  32  and  l/r  =  10,000. 
Efficiency  is  again  smaller  than  in  Figure  5a;  this  time  due  to  increased 
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paging  induced  by  smaller  M.  Notice  that  when  T  >  6000,  there  is  no  gain 
to  be  had  from  multiprogramming.  This  does  not  mean  that  multiprogramming 
with  this  system  configuration  is  bad.  It  merely  illustrates  that  for  this 
system  it  is  not  wise  to  multiprogram  programs  characterized  by  a  =  3»8, 
B  =  2.k   and  l/r  =  10,000.   (if  l/r  =  5000,  then  running  2  jobs  is  advan- 
tageous; see  Figure  6). 

This  introduces  the  scheduling  problem.  That  is,  which  jobs 
should  be  run  concurrently?  A  good  scheduler  whose  purpose  is  to  maximize 
throughput  should  be  able  to  use  information  about  programs'  working  sets 
or  a,e  characteristics  to  determine  an  optimal  load.  We  will  not  pursue 
this  subject  further  here  (see  Denning  [  ho,   Ul  ]  and  Heller  [  TO]). 

Figure  a  shows  the  relative  gain  in  efficiency  over  monoprogram- 
ming  due  to  multiprogramming  with  an  optimal  number  of  jobs 


E(J^)  -  E(l) 


G  =     °F*   (13) 


as  a  function  of  T  for  several  combinations  of  r  and  M  (in  all  cases, 
a  =  3'b)   B  =  2.U).  This  figure  illustrates  that  for  multiprogramming  to 
yield  a  reasonable  gain,  there  must  be  sufficient  primary  memory  (note 
the  I:  =   22  curves). 

Literature  on  multiprogramming  and  tine -sharing  is  extensive 
I  we  will  not  attempt  to  present  a  comprehensive  bibliography  here, 
(instead,  see  Buchholz  [20],  Calingaert  [22],  McKinney  [104]  Trimble 
[13*0  and  Bell  and  Pirtle  [  lU].   Some  useful  studies  can  be  found  in 
[  12,  U9,  52,  56,  65,  107,  130,  131,  132,  136]. 
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IV.  Average  Time  Per  I/O  Request 

In  Section  II  we  introduced  T  as  the  average  interval  between 
the  time  when  a  program  is  forced  to  stop  (due  to  a  lack  of  instructions 
or  data  in  primary  memory)  and  the  time  when  the  program  could  resume. 
In  2.1  and  III,  we  showed  that  CPU  efficiency  is  highly  correlated  with 
the  magnitude  of  T  (see  Figures  2  and  5)«  In  the  following  sections  we 
will  examine  T  in  more  detail.  Specifically,  we  will  discuss  techniques 
whereby  T  can  be  reduced. 

Secondary  storage  devices  range  from  extended  core  storage  to 
magnetic  tape,  but  the  most  common  device  in  use  today  is  the  disk  file. 
The  time  required  for  these  devices  to  deliver  a  block  of  b  words  can  be 
generally  characterized  by 

T  =  t   +  t  +  b/p  (Ik) 

q    a 

where  t  is  queueing  time  before  the  disk  logic  recognizes  an  I/O  request; 
t  is  the  sum  of  head  positioning  latency  and  rotational  latency,  and  p 

3 

is  the  transmission  rate  between  primary  and  secondary  memory.   Four  ways 
in  which  we  can  decrease  the  average  T  are : 

1)  Decrease  t  by  making  the  disk  spin  faster  using  more 

a 

heads  per  surface  or  by  using  extended  core  storage. 

2)  Making  the  disk  spin  faster  or  using  higher  bit  densi- 
ties increases  p.  We  might  also  increase  p  directly 
by  reading  more  heads  simultaneously. 

3)  Use  parallel  queueing  techniques  so  that  the  average 
T  over  n  requests  is  less  than  T. 

*)     Change  the  distribution  of  t  by  planning  the  layout 
of  data  on  the  disk  in  such  a  way  that  the  data  is 
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almost  under  the  read  heads  when  it  is  needed  (this 
technique  is  only  practical  in  systems  doing  large 
calculations  where  a  dedicated  disk  is  available). 
Alternately,  we  can  prefetch  data  blocks  (buffering). 
We  will  now  discuss  some  of  these  techniques. 

-.1  PHYSICAL  LATENCY  OF  SECONDARY  MEMORY 

Consider  a  disk  system  with  one  movable  head  per  surface  and 
with  all  heads  fixed  to  the  same  head  positioner  assembly.   Now  t  ,  the 
access  time  for  this  device,  is  the  sum  of  two  statistically  distributed 

*  i~.es:  t  ,  the  time  to  position  a  head,  and  t.,    the  time  required  for 

22 
the  desired  sector  to  come  under  the  heads: 


t  -  t  +  tf        .  (15) 

a    p    I  v  ' 


Or.e  way  to  make  this  disk  faster  is  to  add  more  heads  to  each  arm  so  that 

the  arm  does  not  have  to  move  so  far  to  position  a  head  over  the  right 

track.  This  tends  to  decrease  t  . 

P 

Another  way  to  decrease  t  would  be  to  have  independent  posi- 

Jr 

tioners  for  each  surface.  Fife  and  Smith  [  5*+  ]  have  presented  a  good 
analysis  of  this  technique.   Several  manufacturers  have  eliminated  t 
altogether  by  providing  one  fixed  head  per  track.  To  provide  further 
speedup  we  could  introduce  multiple  heads  per  track  (a  matter  which  pre- 
sents technological  difficulties)  or  use  a  drum  which  typically  rotates 
faster  than  a  disk  but  does  not  have  as  large  a  capacity.  Both  of  these 
latter  techniques  reduce  t.   in  Eq.  (15).   (See  also  [133]*) 
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Any  further  improvement  in  the  physical  response  of  secondary 
memory  probably  must  come  from  the  use  of  extended  core  storage  (ECS). 
This  is  potentially  quite  expensive  (the  cost  per  word  being  typically 
more  than  one-tenth  that  of  primary  memory)  but  is  considerably  faster 
as  latency  is  on  the  order  of  ten  microseconds  as  opposed  to  tens  of 
milliseconds  for  disks  and  drums.  This  could  double  CPU  efficiency 
(see  Figures  2  and  5)  but  must  be  evaluated  on  the  basis  of  cost  effec- 
tiveness.  Several  studies  of  the  use  of  ECS  can  be  found  in  [  7>  63, 
&S  79,  83,  101]. 

k.2     EFFECTIVE  LATENCY  OF  SECONDARY  MEMORY 

Several  techniques  can  be  used  to  decrease  the  effective  latency 
of  a  disk  device  without  changing  its  physical  characteristics.  For 
instance,  if  several  requests  for  blocks  from  the  disk  are  waiting  for 
service,  then  we  can  decrease  the  average  latency  over  all  requests  by 
servicing  requests  in  the  order  in  which  the  required  blocks  come  under 
the  heads.  Another  possibility  which  can  be  used  in  certain  special  cases 
is  to  coordinate  the  layout  of  blocks  on  the  disk  with  the  timing  of  the 
program  so  that  blocks  will  be  almost  under  the  heads  when  they  are  needed. 

U.3  REQUEST  QUEUEING23 

We  will  assume  that  at  any  given  time,  there  are  n  requests  for 
service  from  secondary  memory  (these  requests  having  been  generated  by 
the  several  programs  being  multiprogrammed).  We  also  assume  that  the 
secondary  memory  is  a  rotating  device  divided  into  M  tracks,  each  track 
being  further  divided  into  N  sectors.  Each  request  is  for  access  to  a 
particular  track  and  sector.  The  rotation  time  of  the  device  is  T,.. 
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Each  request  waiting  for  service  will  experience  a  delay  T, 

sum  of  t  (time  in  queue),  t  (access  time),  and  t  (transmission  time, 
Go  r 

assured  constant). 

The  simplest  way  to  service  these  requests  is  to  establish  a 
single  queue  which  is  serviced  on  a  first  in,  first  out  (FIFO)  basis. 
A  better  strategy  is  to  service  requests  according  to  which  request  can 
be  serviced  next  (FSFO),  i.e.,  the  request  whose  required  track  and  sector 
is  due  under  the  heads  next  is  serviced  first.   Denning  [  39]  shows  that 
for  a  fixed  head  per  track  device  the  ratio  of  delay  time  under  FIFO  to 
delay  tine  under  FSFO  is 


(FIFO)     n(N  +  2)  (l6) 

(FSFO)     N+2(n+l) 


For  Y   =  3*-   sectors  and  n  =  10  requests  then  the  relative  improvement  by 
Eo.  (l6)  is  7.66.  That  is,  the  response  of  a  fixed  head  device  with  6k 

sectors  and  10  waiting  requests  is  J. 66   times  better  under  FSFO  than  under 

Zk 
FIFO.    An  analysis  of  movable  head  devices  shows  that  improvement  can 

also  be  affected  by  similar  scheduling  algorithms,  but  the  improvement  is 

not  as  dramatic. 


U.U  MINIMIZATION  OF  EXPLICIT  I/O  REQUEST  TIME 

A  number  of  large  scale  calculations  require  space  for  their 
data  and  instructions  which  exceeds  the  available  primary  memory.  These 
calculations  involve  operations  on  very  large  arrays  and  may  require 
several  tens  of  hours  per  production  run  on  the  fastest  computers.   In 
such  cases  there  is  no  point  to  interval  time  slicing  of  the  computation 
for  user  interaction,  although  system  throughput  can  be  enhanced  by  multi- 
programming, as  discussed  in  Section  III.   If  we  restrict  our  attention 
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only  to  these  kinds  of  large  jobs,  then  one  limiting  case  is  a  large 
machine  with  one  large  job  at  a  time,  i.e.,  batch  processing.  We  will 
now  turn  to  a  discussion  of  preplanning  the  layout  of  a  secondary  storage 
device  in  such  a  way  that  explicit  I/O  request  time  is  minimized.  The 
interleaving  of  several  jobs  will  not  be  discussed  except  to  remark  that 
in  such  cases  the  execution  time  requirements  become  less  stringent  for 
each  job,  but  the  sequencing  of  the  interleaved  steps  presents  new 
difficulties. 

Historically,  there  are  many  examples  of  preplanned  drum  layout. 
When  drums  were  used  as  primary  memory,  optimizing  assemblers  would  locate 
the  sequence  of  instructions  at  appropriate  intervals  around  the  drum  so 
that  (in  jump  free  segments  of  code)  the  next  instruction  would  always  be 
available  when  the  previous  one  was  finished  [ll8]»  For  current  machines 
in  monoprogramming  mode,  it  is  reasonable  to  assume  that  enough  code  resides 
in  primary  memory  at  any  time  so  that  the  time  required  to  perform  instruc- 
tion overlays  is  negligible.  However,  data  overlays  may  be  extensive  and 
we  might  be  able  to  decrease  the  latency  involved  in  obtaining  data  blocks 
from  secondary  memory  by  planning  the  layout  of  these  data  blocks  and  pre- 
fetching data. 

The  question  of  overlaying  data  must  be  considered  with  respect 
to  the  average  amount  of  processing  which  may  be  performed  on  each  data 
element.   Many  matrix  calculations  (e.g.,  multiplication,  inversion,  eigen- 
value  calculation)  require  a   N  operations  where  a  <  1  and  N  is  the  dimen- 
sion of  the  matrix.  Also,  it  can  be  empirically  observed  that  a  number  of 
partial  differential  equation  solution  techniques  on  N  x  N  meshes  require 
a  I   operations  per  iteration,  where  cc   is  generally  smaller  than  in  the 
matrix  case  but  usually  greater  than  0.1.   In  the  partial  differential 
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equation  case  it  is  sometimes  possible  to  iterate  several  times  on  a 
block  in  memory,  thus  increasing  a.  If  we  assume  a  fr  operations  on  N 
data  elements,  then  each  element  requires  a  N  operations,  where  an  opera- 
tion may  be  regarded  as,  e.g.,  a  multiply,  an  add,  and  a  memory  fetch  or 

say,  one  microsecond  on  a  current  machine.  Let  us  assume  a  machine  with 

2 
N  words  of  memory  available  for  each  block  transmitted  from  the  disk. 

This  allows  a  flr  microseconds  of  computation  per  block.  If  a  =  .5  and 
'.:  ■  6U,  then  we  compute  for  about  125  milliseconds  per  block.  This  is 
more  time  than  is  required  for  the  rotation  of  any  current  large  disk, 
which  is  usually  in  the  range  of  UO  to  60  milliseconds.  Thus,  if  we 
can  always  keep  one  input  request  ahead  in  a  disk  queuer,  it  should  be 
possible  to  completely  mask  the  I/O  request  time. 

As  the  ratio  of  processor  speed  to  disk  rotation  speed  gets 
larger,  this  problem  becomes  more  difficult.   Suppose  we  have  a  calcula- 
tion with  the  same  parameters  as  above,  but  we  wish  to  use  a  processor 
which  is  ten  times  faster.   Then  we  have  only  12  milliseconds  of  computa- 
tion time  per  block  and  this  is  faster  than  the  rotation  time  of  any  large 
iisk.   T:.ere  are  several  obvious  ways  to  avoid  this  problem.   One  is  to 
increase  M;  this  may  require  a  larger  primary  memory.  Another  is  to  sup- 
ply the  disk  queuer  with  several  requests,  thereby  decreasing  the  expected 
time  until  some  request  is  honored  [  39].   In  some  cases  there  are  uniform 
but  intricate  relationships  between  the  data  blocks  and  their  processing 
sequerce.   To  handle  these  cases,  we  can  attempt  a  third  solution,  namely 
the  preplanning  of  block  layout  on  the  disk. 

Consider  the  problem  of 
matrix  multiplication  using  a  head  per  track  disk.   Suppose  that  both 
operand  matrices  are  partitioned  into  square  blocks,  that  the  prernultiplier 
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is  stored  by  rows  of  partitions,  and  that  the  postmultiplier  is  stored 
by  columns  of  partitions.  Let  us  also  assume  that  the  angle  on  the  disk 
between  the  positions  of  successive  partitions  represents  the  disk  motion 
time  equal  to  the  processor  time  required  to  multiply  two  square  partitions. 
Now  if  it  happens  that  one  row  (ana  column)  of  partitions  ends  just  where 
the  next  starts,  then  it  is  clear  that  such  a  disk  storage  scheme  allows 
matrix  multiplication  with  no  CPU  time  lost  due  to  waiting  for  data  from 
the  disk.   It  is  also  clear  that  if  a  sequence  of  matrix  operations  are 
required,  then  the  preplanning  of  the  disk  layout  becomes  more  complex. 
Ir.  general,  some  I/O  wait  time  will  be  required  of  the  CPU.  However,  in 
order  to  use  any  matrix  as  a  premultiplier  or  postmultiplier,  it  is  possi- 
ble to  store  all  matrices  in  such  a  way  that  they  may  be  fetched  by  row 
partitions  or  column  partitions.  This  is  achieved  by  storing  the  second 
partition  of  the  first  row,  say  A  p,  in  the  same  relative  position  on  the 
disk  as  the  first  partition  in  the  second  row,  say  Ap,.  This  skewing 
pattern  may  be  continued  in  the  obvious  way,  given  a  sufficient  number  of 
disk  surfaces.  Matrix  inversion  and  eigenvalue  calculations  require  much 
more  intricate  disk  storage  schemes,  but  the  problems  are  similar  [  91]* 
A  somewhat  more  difficult  set  of  constraints  is  encountered  in 
some  problems,  e.g.,  explicit  partial  differential  equation  methods.   In 
these  cases  it  is  necessary  to  sweep  through  an  array  of  data  repeatedly. 
When  any  partition  of  the  array  is  being  processed,  it  is  necessary  also 
to  have  some  data  elements  from  neighboring  partitions.  For  example,  if 
a  five  point  finite  difference  operator  is  being  applied  to  M  element 
partitions  of  an  array,  then  vH  border  elements  are  required  from  each 
of  the  four  adjacent  partitions.   It  should  be  possible  to  pack  these 
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border  elements  in  separate  arrays,  then  write  and  read  them  on  and  off 
the  disk  at  appropriate  times.  Assume  the  calculation  on  an  M  element 
partition  requires  time  T  .   Next  assume  it  is  possible  to  map  partitions 

of  the  array  onto  the  disk  such  that  the  one-way  transmission  time  for  a 

T  -€ 

c 

partition  is  — = —  .  Now  we  can  read  a  new  block  and  write  an  old  block 

T  -€ 
in  2(— — )  =  T  -e.   If  the  edge  values  of  the  neighboring  blocks  can  be 

transmitted  in  and  out  to  the  disk  in  e  time  units,  then  the  scheme  main- 
tains a  steady  state  balance  between  computation  time  and  I/O  transmission 
time. 

A  somewhat  weakened  set  of  conditions  are  imposed  in  Bernott  [15] 

"re  it  is  assumed  that  T  is  not  less  than  five  times  the  one-way  trans- 
mission time  for  a  bloc';.  Various  depths  of  finite  difference  operators 
and  any  rectangular  mesh  are  allowed.  Also,  the  number  of  variables  being 
computed  is  a  parameter.   In  terms  of  several  latency  considerations  and 
the  above  mentioned  parameters,  a  disk  layout  is  computed  which  gives  a 
resulting  computation  scheme  that  has  an  overall  expected  CPU  efficiency 

ater  than  &0$. 


V.   Summary  and  Extensions 

As  computer  systems  become  more  complex  and  as  user's  require- 
-_s  become  more  specialized,  the  computer  system  designer  must  give 
r.ore  attention  to  overall  system  cost  performance  when  he  designs  each 
part  of  the  system.   In  other  words,  he  must  study  more  and  more  trade- 
offs between  various  parts  of  the  system. 

In  this  paper  we  have  discussed  some  interrelations  between 
-tern  parameters  including:  primary  memory  size,  page  size,  secondary 
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memory  speed,  I/O  request  queuers,  and  the  number  of  jobs  multiprogrammed* 
These  together  with  user  program  parameters  including:  mean  time  to  access 
p  pages,  number  of  instructions  executed  per  datum  and  regularity  of  addres- 
sing a  data  structure  have  a  major  influence  on  the  CPU  efficiency. 

V/e  limited  our  discussion  to  two  -level  memory  hierarchies,  but 
the  techniques  mentioned  can  be  applied  to  more  levels  by  lumping  several 
levels  and  reducing  the  problem  to  one  of  two  levels.  This  requires  approx- 
imating the  parameters  of  a  lumped  level  using  the  parameters  of  the  levels 
being  combined.  The  use  of  a  two-level  primary  memory  is  quite  successful 
in  the  IBM  360/85  [  66}*     It  is  also  common  to  use  a  fast  drum  between 
primary  memory  and  a  slow  disk  [3*+  ]•  Machines  which  operate  on  arrays 
of  data  and  are  organized  as  arrays  of  arithmetic  processes  are  now  being 
designed.  For  example,  the  pipeline  processors  [12U]  (which  might  be  called 
serial  array  processors)  and  ILLIAC  IV  [10  ]  (which  might  be  called  a  paral- 
lel array  processor)  have  many  individual  memory  units,  and  this  fact  makes 
it  necessary  to  carefully  plan  the  layout  of  data  in  primary  memory  for 
maximum  CPU  utilization.  The  kinds  of  storage  planning  discussed  below 
might  be  regarded  either  as  minimizing  the  number  of  data  faults  or  the 
time  per  data  fault  because  the  question  is  that  of  supplying  data  to  the 
processor  from  the  primary  memory  at  a  maximum  rate. 

Serial  array  processors  generally  require  a  memory  whose  effec- 
tive cycle  time  is  equal  to  the  CPU  clock  time.  This  is  achieved  by  inter- 
leaving many  slower  memory  units  in  a  large  bank.   Since,  in  general  two 
vectors  are  entering  the  processor  and  one  is  emerging,  it  is  convenient 
if  at  least  three  such  banks  are  available.  Clearly,  serious  memory  con- 
flicts can  arise  in  this  situation.   If  two  argument  vectors  are  stored 
in  the  same  bank,  the  processing  speed  may  be  cut  in  half. 


-  42  - 


Since  present  serial  array  processors  reach  a  speed  limit  due 
to  the  fact  that  the  pipeline  length  can  be  made  no  longer  than  the  number 
of  elementary  steps  in  an  arithmetic  operation,  parallel  array  processors 
see™  to  be  a  logical  necessity  for  more  speed  improvement.  The  memory 
system  of  IT..LIAC  IV  consists  of  one  memory  unit  per  processor.  Each  mem- 
ory unit  is  directly  accessible  by  just  one  processor.  A  network  of  rout- 
ing logic  may  be  used  to  get  data  to  other  processors.   If  one -dimensional 
arrays  are  stored  with  one  element  per  processor,  then  the  full  speedup 
over  a  single  processor  may  be  achieved.   In  two-dimensional  arrays,  row 
operations  are  easy  to  perform  with  a  straightforward  mapping  of  an  array 
into  the  memory,  e.g.,  rows  are  stored  across  the  processors  and  each 
column  is  within  a  processor.   Similarly,  column  operations  are  easy  with 
a  transposed  array.   However,  if  both  row  and  column  operations  are  required 
with  such  a  storage  scheme  using  an  n  processor  machine,  then  operations 
in  one  direction  will  realize  an  n-fold  speedup  but  operations  in  the  other 
direction  will  realize  no  speedup  at  all  over  a  one  processor  machine.  If 
row  and  column  operations  are  required,  some  kind  of  skewing  scheme  as  out- 
lined in  Section  IV  will  provide  the  full  speedup  [90].   It  may  be  expected 
that  in  the  future,  parallel  arrays  of  pipeline  processors  will  require  even 
rr.ore  intricate  primary  storage  mapping  schemes. 

It  should  be  remembered  that  we  have  been  discussing  just  one 
underlying  subject  throughout  this  paper:  the  ratio  of  cost  to  performance 
for  en  overall  computer  system.  We  have  attempted  to  relate  several  memory 
parameters  and  program  characteristics  to  the  system  performance  as  meas- 
ured by  CPU  utlization. 
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LIST  OF  FOOTNOTES 

1.  Note  we  always  measure  time  in  instruction  executions;  i.e.,  we  scale 
time  by  the  average  instruction  time. 

2.  The  results  of  these  experiments  consisted  of  1737  execution  bursts 
from  lo2  service  intervals  for  five  programs:  l)  LISP,   2)  an 
interpretive  meta  compiler,   3)  an  interpretive,  initially  inter- 
active, display  generation  system,  h)     an  interactive  JOVIAL  compiler, 
and  5)  a  concordance  generation  and  reformatting  program.  Page  size 
was  1024  words. 

3-   This  corresponds  to  imposing  a  variable  q  on  the  program.   Smith  [132] 

-+■  IP- Pi 

indicates  this  q  had  a  hyper exponential  distribution,  w.  e   '   + 

-t/U0.7  x  103 

v2  e 

4.   See  Denninp  [  40,  4l  ]. 

5«   We  assume  that  the  first  page  is  referenced  at  t=0  with  probability 

1  (t  =0)  which  accounts  for  the  difference  between  this  formula  and 

that  of  Shemer  and  Shippey. 

Determined  from  a  least-squares  fit  to  the  function,  £     t     =  a  +  y&np 

where  5  =  e  .  Average  error  over  18  points  was  l6$. 

It  should  be  remembered  that  values  of  a  and  (3  are  characteris- 
tics of  a  given  program  or  class  of  programs,  and  should  not  be  used 

to  describe  all  programs.  A  similar  study  of  results  [135]  from  a 

SIIOBOL  compiler  yielded,  cp(p)  =  .51+  p   . 

Belady  and  Kuehner  [12  ]  suggest  the  function  <p(p)  =  ap  in  their 

paper  although  they  fail  to  state  their  reasons. 

he  general  Poisson  process,  the  average  number  of  pages  referenced 

in  time  t  isnot  f  (t)  where  t  =  f(p).   However,  we  are  using  this 

only  as  an  approximation. 
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9-   The  following  definitions  follow  Randell  [113]. 

10.  Randell1 s  experiments  consisted  of  generating  storage  requests  in  the 
form  of  a  segment  size  and  time  to  be  spent  in  memory.  Time  in  memory 
was  always  generated  from  a  uniform  distribution  of  1, 2,3>4,5«   Segment 
size  was  generated  from  several  distributions.  B  was  1024  and  Q  was 
varied  from  32  to  1024  in  powers  of  2.  Total  memory  size  was  32K. 

It  was  assumed  that  requests  for  memory  were  always  waiting  to  be 
filled. 

11.  .For  Q  =  B/32,  utilization  varied  from  over  95%  for  s  =  4B  to  about 
90^  for  s  ■  B/2.  At  Q  =  b,  utilization  varied  from  just  under  90% 
for  s  =  4B  to  about  h%   at  s  =  B/2. 

12.  Until  stated  otherwise,  we  now  assume  b  =  Q  =  B,  i.e.,  page  size  is 
constant  over  a  given  experiment. 

13*   This  data  comes  from  two  program  loads:  1)  "10  small  FORTRAN  compil- 
ations and  loads"  and  2)  "FORTRAN  compilations,  and  executions,  used 
to  debug  the  44x  FORTRAN  compiler. "  Apparently,  there  is  negligible 
internal  and  external  fragmentation  in  this  experiment. 

14.   This  data  is  from  an  integer  programming/calculation. 

15-   Since  apparently  M(l)  <  aQ  +  a  . 

16.  Again  in  this  and  the  following  experiment,  there  is  apparently  neg- 
ligible fragmentation. 

17.  See  Rosene  [119  3- 

16.   B<lady'c  results  are  based  on  the  simulated  execution  of  11,000,000 
instructions  of  an  integer  programming  code  written  in  FORTRAN. 

19.   Gaver's  model  considers  I  channels  with  both  I  <  J  and  I  >  J.  We 

will  only  consider  the  case  where  I  >  J;  i.e.,  there  are  no  conflicts 


-  45  - 


for  secondary  memory. 

The  assumption  of  an  exponentional  distribution  of  I/O  completion 
time  is  not  particularly  realistic  as  Gaver  admits.  Since  we  are  using 
T  to  represent  the  average  time  required  to  complete  all  kinds  of  I/O 
requests,  paged  or  explicit,  the  density  of  T  will  probably  consist  of 
a  collection  of  exponential,  Gaussian,  and  delta  functions.  However, 
even  with  a  simple  exponential  distribution,  the  total  expectation 
functions  become  quite  complex,  and  a  more  complex  distribution  would 
not  be  warranted  here.   See  Smith  [132]  for  a  slightly  different  model. 

20.  Pzces   are  here  assumed  fixed  at  102U  words. 

21.  Actually,  this  could  only  be  true  if  M  were  some  multiple  of  J.  However, 
if  M  »  J,  this  is  not  a  bad  approximation.  We  also  assume  here  that 
programs  are  not  swapped  out  of  primary  memory  while  waiting  for  I/O. 

22.  See  Frank  [  61]  for  an  analysis  of  the  statistical  properties  of  disk 
systems. 

23.  Our  development  in  this  section  will  follow  Denning  [  39].   See  also 
[  26,  132,    139,  1U0]. 

'elk.       The  particular  case  of  Gaver 's  model  which  we  used  in  Section  III 

assumed  r.o  conflicts  for  secondary  memory,  i.e.,  rate  of  I/O  comple- 
tion was  not  dependent  on  the  number  of  jobs  (requests).  The  tech- 
niques discussed  here  are  not  as  good  as  those  assumed  in  Section  III. 
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